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.^.j , Abstract: This paper proposes a new method and algorithm for predicting 

C^ ■ multivariate responses in a regression setting. Research into classification 

of High Dimension Low Sample Size (HDLSS) data, in particular microar- 

ray data, has made considerable advances, but regression prediction for 

high-dimensional data with continuous responses has had less attention. 

_ , Recently Bair et al (2006) proposed an efficient prediction method based 

*>— »' ■ on supervised principal component regression (PCR). Motivated by the fact 

_ that a larger number of principal components results in better regression 

(_^ ' performance, this paper extends the method of Bair et al in several ways: 

nJ ' a comprehensive variable ranking is combined with a selection of the best 

number of components for PCR, and the new method further extends to re- 
gression with multivariate responses. The new method is particularly suited 
to HDLSS problems. Applications to simulated and real data demonstrate 
the performance of the new method. Comparisons with Bair et al (2006) 
show that for high-dimensional data in particular the new ranking results 
in a smaller number of predictors and smaller errors. 



AMS 2000 subject classifications: Primary 62J99; secondary 62H99. 
, ; , Keywords and phrases: Dimension Selection, Multivariate Regression, 

frt ' Multivariate Responses, Principal Component Regression, Variable Rank- 

ing, Variable Selection. 



1. Introduction 

Classification of high-dimensional data - motivated mainly by the importance 
of tumor classifications for high-dimensional microarray data - has attracted 
much recent attention. To date less attention has been paid to the prediction 
of survival times for gene expression data, although such prediction can pro- 
vide valuable additional knowledge and important information in the selection 
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of relevant genes. One important advance in this area is Bair et al (2006) who 
proposed a prediction method for multiple regression based on supervised princi- 
pal components. They proposed not only a new method, but also an underlying 
model based on latent variables which provides a good and theoretically founded 
explanation of their method. 

In this paper we focus on prediction of continuous response variables in a 
very general linear regression framework and extend the research of Bair et 
al (2006) in a number of crucial ways. Like Bair et al we allow a very large 
number of predictors, but in addition we consider multivariate reponses as met 
in practice, for example, in the prediction of treatment placement and facilities 
in drug related offences. Including multivariate responses in our model is a 
genuine extension over Bair et al whose approach cannot handle such data. 

A central problem in linear regression from multivariate predictors is the 
choice of variables that are included in the prediction model. Section 2 briefly 
reviews the classical case and principal component regression (PCR). For high- 
dimensional data PCR has an inuitive appeal for regression prediction, and we 
follow that path - similar to Bair et al (2006). But unlike their approach, which 
uses the first principal component only, wc integrate the dimension selection of 
Koch and Naito (2007) into PCR, which yields a better fit based on more than 
one principal component. 

Compared to Bair et al, our prediction model is more complex and extensive; 
our choice of dimension has a theoretical foundation, and since we use more 
than one predictor, our prediction model results in more accurate prediction. 

Another prediction method with applications in chemometrics has been pro- 
posed in Gustafsson (2005). This approach also falls into the PCR framework 
but then uses a further rotation of the sphered principal components. For re- 
gression predictions such rotations have no effect, since they would cancel out 
in an appropriate verson of (5) . For this reason we compare our results to those 
of Bair et al (2006). 

A key issue in classification and prediction of a continuous response variable 
from high-dimensional data is the preselection of a moderate number of variables 
that have strong predictive power. In microarray analysis this first selection is 
often achieved by univariate t-tests. Bair et al (2006) replace these tests, essen- 
tially by calculating the univariate correlation coefficient between each predictor 
and the response variable, and make a preselection of the variables based on the 
absolute value of these correlation coefficients. This preselection is simple to im- 
plement and understand, but does not capture the interaction of the variables 
or genes. We address the preselection problem and propose a ranking of the 
variables which takes into account all predictors simultaneously, and thus takes 
care of valuable interaction between the variables in the preselection. 

Our prediction method can be regarded as a two-step variable selection. The 
first step selects the variables that are most important for prediction, while the 
second step summarises these variables in a smaller, and judiciously chosen, 
number of components. 

The paper is organised as follows. Section 2 reviews regression prediction, and 
describes variable ranking based on a canonical correlation approach. Section 
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3 proposes our regression algorithm which is based on variable ranking and di- 
mension selection. Section 4 provides more details and properties of our variable 
selection, and includes comparisons with the method of Bair et al (2006). Sec- 
tion 5 shows how our method performs in practice for simulated and real data, 
including an example of high-dimensional microarray data, and gives compar- 
isons with the approach of Bair et al (2006). We conclude with a discussion of 
our method and algorithm in Section 6. 

2. Prediction and Canonical Correlations 
2.1. Regression Prediction 

Wc consider the multivariate regression setting: 

Y = XB + E, (1) 

where Y = [yi • • • yAr]^ denotes the matrix of responses, and, each response y,; is 
a vector of length q. Further, B = [(3i ■ ■ ■ I3J\^ is the px q matrix of coefficients, 
and E is the N x q error matrix. The design matrix is the same as in a multiple 
regression setting: X = [xi • • • Xp] , so each variable or feature vector Xj is of 
length TV. Throughout this paper we will assume that the columns of X are 
centred. An estimator for B is given by 

B = (X'^X)-iX'^Y, (2) 

where X is derived from X, and a new g-dimensional response y is predicted 
for a single p-dimensional datum xq by 

y^=xJ(X^X)-iX^Y. (3) 

li p < N and (X^X)"'^ exists, then X = X leads to an estimator and new 
predictions based on all variables. If the inverse (X-'^X)"^ does not exist or 
is unstable, then a smoothed or penalised inverse as in ridge regression or the 
more recent lasso - see Hastie et al (2007) or Meier and Biihlmann (2007) and 
references therein - is commonly used. In this paper we shall not be concerned 
with such inverses. 

In multiple linear regression the aim is to reduce the number of variables. 
Without going into the different methods of finding such subsets of variables, but 
assuming instead one has found a subset X~ where the superscript ~ indicates 
that some variables are omitted, then X = X~ and Xq = Xq in ^ and ^. 

PCR does not leave out variables, but uses weighted sums of all variables, 
with weights allocated according to the contribution to variance of the variables. 
Let 

ZW=Xrfc = [zi...Zfc]. (4) 

Then Z^'^^ is an A^ x fc matrix, with k < p,Tk denotes the matrix which consists 
of the first k eigenvectors of the sample covariance matrix of the predictors, and 
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the i-th column vector z^ is obtained by projecting X onto the i-th eigenvector 
7i of Ffc. In the PCR setting X = Z'*^^ and xo = Zg , and one obtains the 
estimator B and predictor y given by 

B = (zW^Z('^')"'zW^Y=(rJX^Xrfc)"'rJX^Y and 

r = ^^n{rlx^xr,)-'rlx^Y. (5) 

Since zi is a hnear combination of all variables, fc = 1 is often used in PCR, 
for example in Bair et al (2006). This first principal component often does 
not sufficiently summarise the predictors, and then it is important to find an 
appropriate value for k. We return to this choice in Section 3.5.2. 

2.2. Variable Ranking and Canonical Correlations 



In univariate regression the correlation coefficient p oi X and Y is defined by 

cov{X,Y) 



^Jvar[X]var\Y] 



(6) 



where cov refers to the covariance and var to the variance. For regression, this 
coefficient relates the standardised response and predictor variables. 

In multiple linear regression the absolute value of the correlation coefficient 
can be used to order the predictor variables according to the strength of their 
correlation with a univariate response y, so 

[Xi---Xp] -^ [X(i)---X(p)] 

where the vectors xj^) are ordered such that |p(i)| > |p(2)| ■ • • \P(p)\i and p(i) 
denotes the correlation between a response vector y and X(,j) . As pointed out in 
Section 1, the correlation coefficient takes into account one feature vector at a 
time - so considers only the marginals - and cannot account for any interaction 
between the XjS. Further, the correlation coefficients and the above ordering do 
not extend to multivariate responses. 

For a multivariate predictor X and a multivariate response Y a natural 
generalisation of p is the matrix of canonical correlations 

C = I]^'/'I]xYSy'/^ (7) 

which replaces the univariate cov{X, Y) of ^ by the p x q covariance matrix 
T,xY of X and Y, and each of war[X] and uar[y] by their respective variance 
matrices Ex and Sy. The matrix C is commonly used in canonical correlation 
analysis, and contains information about the strength of the relationship of the 
individual variables. For a univariate response Y, the matrix C reduces to a 
p X 1 vector whose entries give rise to an ordering or ranking of the p variables 
of X which makes use of all interactions of the data. In Step 1 of our algorithm 
we show how the matrix C induces a a ranking of the data variables which also 
applies 
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1. to multivariate responses Y, and 

2. to data whose dimension p exceeds the sample size N . 

3. A Regression Algorithm Based on Ranking and Dimension 
Selection 

3.1. The Underlying Latent Variable Model 

The model for a single response vector y is 

y = W'^s + (5, (8) 

where W^ is a q x H matrix, s is an _ff x 1 latent vector with non-Gaussian 
components, H <^ p, and ^ is a q x 1 error vector. Let / denote a proper subset 
oi {1, ...,p} and set |/| ^ m < p. For i E I, the i-th feature of the input vector 
x is modelled by 

Xi=p'[s + T]i, (9) 

where the p^s are iJ-dimensional vectors, the rji are errors, and H < m. Com- 
bining dH]) and dHI, we have 

(10) 

where x, is a m x 1 subvector of x defined by the components of ([9]) and P 
is a TO X i? matrix having pf (i G /) as its rows. The ?7i-dimensional vector 
X* contains the features which are most relevant for prediction and which are 
modelled by a smaller number H of latent variables. Step 1 of the algorithm 
shows how we obtain the vector m, and Step 2 elucidates the choice of H. 

3.2. Variable Ranking with C 

We begin with an inspection of the matrix C given in ([7]) which can be regarded 
as a multivariate correlation coefficient of size p x g. To be able to rank the 
variables of X, we use the fact that the strongest correlation between the X and 
Y variables is given by the largest eigenvalue, ki, of C. 

The first left and right eigenvectors of C, denoted by hi and gi respectively, 
satisfy 

Cgi = Kihi. (11) 

Further the entries of hi and gi contain relative weights for the variables of 
X and Y (see Theorem 3.6, Chapter 3 of Koch (2009)). Next observe that the 
correlation coeflficient p relates the univariate standardised variables X and Y 
via Y/ay = pX/ax, where cry and ax is the standard deviation of Y and X, 
respectively, and both X and Y are assumed to be centred. The multivariate 
analogue of this relationship is 

S^^/^Y = C^S^^/^X or equivalently 

Y = C'^X, (12) 
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where C = TT^I'^CY^^'^ = TT^Y^xy ■ Combining ^ and (HH) wc note that C 
apphes to the sphered data while C appUcs directly to the raw data. Put 

b=-S^^/'Cgi, (13) 

and observe that b is the first (left) canonical covariate vector (see Chapter 3, 
Koch (2009)). 

Ranking of variables is particularly important for HDLSS data, when the 
number of variables exceeds the number of observations. In this case the usual 
estimate of the covariance matrix Hx is not invertible. Assume that X has rank 
r (with r < min(A^,p)). Let 

X = ULV^ (14) 

denote the singular value decomposition of X where U and V arc N x r and 
p X r matrices, respectively, both with orthonormal columns, and L is the r x r 
diagonal matrix with singular values as its entries listed in decreasing order. 

For notational convenience we denote the sample version of C by C. Com- 
bining ([7]) and P^ we obtain 

C = (X^X)-'/' (X^Y,) (YfY,)"'/' = VU^Y, {Y^Y^y'^' , (15) 

where Y^, is the centred Y. Similarly, the sample version b of the vector b in 
(fT5|) is calculated by 

b = ^ (X^X)-'/'agi = ivL-iU^Ye (YfY,)-^/'gi, (16) 

where Cgi = Kihi is the sample version of ([TT|) with ki and gi the first eigen- 
value and right eigenvector respectively. The expressions (jl5|) and (J16p do not 
depend on the relationship between p and N^ so can be equally applied to con- 
ventional data with p < N a,s well as to HDLSS data. 



3. 3. Pursuit of Interesting Dimensions 

Since the 1970s and more particularly since Friedman and Tukey (1974), sub- 
spaces in data are regarded 'uninteresting' if they are random or unstructured, 
and conversely, a projection, or subspace is interesting if it is far from Gaus- 
sian. Following these ideas for a given data structure X, which could denote 
the original data, ranked data or a subset of the variables of the original data, 
we aim to find the subspace of those variables of X which contain interesting 
non-Gaussian structure. The first principal component does not normally con- 
tain enough structure, since it is one-dimensional , while the whole data contain 
structure and randomness. The goal is thus to search for a low-dimensional 
subspace which contains the essential information in the data. 

Two closely related methods. Projection Pursuit and Independent Compo- 
nent Analysis, find non-Gaussian projections in data. See Huber (1985), Jones 
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and Sibson (1987), Friedman (1987) and Hyvarincn, et al (2001) for good ac- 
counts of how to find one or more projections. 

More recently Scholz et al (2004) and Koch and Naito (2007) proposed meth- 
ods for determining the number of these projections or the dimension of the 
subspace. Motivated by their sub-Gaussian metabolite data, the 'optimal' di- 
mension in Scholz et al (2004) is that which results in the highest number of 
independent components with negative kurtosis. Koch and Naito (2007) use 
kurtosis and skewness for their dimension selection method which is generally 
applicable and has a theoretical foundation. For this reason we will apply their 
criterion in the pursuit of the best subspace dimension. Wc briefly review the 
relevant parts of their method and relate their theoretical results in Section l473l 

Koch and Naito's starting point is the observation that the most interesting 
lower-dimensional subspace is that in which the largest deviation from the Gaus- 
sian can be obtained. An appropriate measure, which can be calculated directly 
from the data, is kurtosis. Let Xpj with j = 1, . . . ^N denote p-dimensional ob- 
servations, here regarded as columns of X^. The sample kurtosis (3 is defined 

by ^ ^ 

/?(X) = /3(xp,i,...,Xp.Ar) = max _B(a|xp,i,...,Xp.7v), (17) 



where 



B(a|xp,i,...,Xp,7v) 



J_ v^ ( g (xp^i - Xp ) 



U^ ^ is the unit sphere in W, S is the sample covariance of X, and Xp is 
the sample mean which is zero due to centring of X. For 2 < k < p, put 
2(fc) = Z^'^^A^"^'^ with Z^*^) as in ^ and A^ the covariance matrix of Z'-''\ 
We calculate the sample kurtosis of the Z''^'', denoted by /3fe(Z('')). The most 
non-Gaussian dimension is the H which satisfies 

H = argmax2<,<p J ^^^^(Z^")) - tA , (18) 

where Tk is a certain constant used for bias-adjustment. 

The choice of the most non-Gaussian dimension in Koch and Naito's method 
requires a sequence of subspaces of dimensions I < k < p. Principal component 
analysis (PCA) does provide such a sequence, but PCA is based purely on 
variance, and may therefore exclude variables that are important for regression 
prediction. For this reason, we cannot apply their criterion directly to the k- 
dimcnsional sphered PC data. Step 2 of our algorithm shows how the selection 
is achieved in a regression framework. Details relating to the choice of rj. are 
given in Section [ 



3.4- A Regression Algorithm 

In this section we present our algorithm which consists of three steps. Step 1 
achieves variable ranking which plays an important role in our algorithm and is 
one of the reasons why the phrase supervised can be used. 
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An effective dimension reduction method is applied in Step 2 and prediction 
as described in (O is accomplished in the final Step 3. 

34.1. Step 1 

Let [X Y] denote the predictor and response variables which satify ([9]) and ([8]) . 
Let C and b denote the ranking matrix and ranking vector obtained from [X Y] 
as in P^ and P^ . For to = 2, ... ,p let 

\b,A>\%^>--->\%J 

denote the m largest entries (in absolute value) of b, ranked in decreasing order, 
and define the ?Tt-dimensional ranked data 



^m,l 



T 
^m,N 



(19) 



where the i-th variable is the variable corresponding to the i-th largest entry 
bj. of b. Our notation indicates that the column vectors Xj. are feature vectors 
while the row vectors x^ j, correspond to the N observations. 



3.4.2. Step 2 

For rn ~ 2, . . . ,min(iV, p) apply principal component analysis to X,„. Sphere 
the principal components to obtain the N x m ranked and sphered PC data 

5,„ = X,„rA-i/2^ (20) 

where -/V~^X^X,„ = FAF denotes the spectral decomposition with the eigen- 
values in A arranged in decreasing order, F is the orthogonal matrix whose 
columns are eigenvectors belonging to the elements of A. 

For m and the ranked and sphered PC data Sm calculate the sample kurtosis 

^ —1/2 

/3fe(XmFfeAj. ) for each k < m, with A^ the appropriate covariance matrix, 
and determine the dimension H = H{m) as in (|18|1 . Put 

X„i,H = X„iF//, 

where Th denotes the first H columns of F, and so ^m,H is the matrix which 
consists of the first H principal components of Xm . 

3.4.3. Step 3 

For 771 = 2, ... , inm{N,p) and a typical explanatory variable z g R^ predict the 
g-variate y as in ([3]) from the ranked data X^: 

y^ = ^m^H [X.'^^H^m,HJ X^,^Y = zJ^T H (^F^X^jX,„F/f j fJX^,Y, (21) 

where z„i is a subvector of z containing the first m ranked variables as defined 
in dUl) in Step 1. 
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4. Some Properties 

4.1. Remarks on the Model and Algorithm 

4- 1.1. Remarks on Step 1. 

Our model is based on a set of features X» which are described by X,„ in the 
algorithm. This means obtaining X,„ from X via C is equivalent to finding the 
set of indices I which characterise X* . ^ 

In addition to the ranking with b we also rank the data with the vector hi 
which is the sample version of hi in pT|) : see also (|T6|) . If a distinction between 
these two ranking schemes is required we put 

bi = b and b2 — ;^Cgi — hi. 

Kl 

Although bi and b2 will lead to different submatrices X™, the rest of our 
algorithm proceeds in the same way for both ranking schemes. A comparison 
between the two ranking vectors shows that, apart from the scale factor 1/ki, 

bi contains sphering with (X"^X) , therefore applies to raw or centred data, 
while b2 could be interpreted as applicable to data where sphering is not required 
or appropriate. 

With very large data sets in particularly, it is not clear whether the spher- 
ing component is required, and we therefore propose to use both versions in 
applications. 

Our ranking extends the ranking proposed by Bair et al (2006) which is 
essentially based on the correlation coefficient between the univariate response 
and each feature vector. We will return to these ranking schemes in Section 221 

4- 1-2. Remarks on Step 2. 

The ranked feature set X^ is generally still too large to find the 'best' non- 
Gaussian predictors corresponding to s in p^ . We use dimension reduction 
with PCA to pursue the best non-Gaussian subspace. 

Using the notation of ([20]) . we denote the eigenvalues of iV^^X^jXm by Ai > 
^2 > • • • > Am > 0, and the eigenvectors by F = [y^ ■ ■ •7,„]- ^^ follows that 

N-'X^X^j^ = \,jj .7 = l,...,m, (22) 

and hence 

m H m 

j=l j=l j=H+l 

In Step 2 we choose the cut-off 'i?' of p^ as in Koch and Naito (2007); so H is 
the dimension which yields the most non-Gaussian projection. For this selection 
of H we obtain Xm^H = "^rnXH- Further, by applying P^ to subspaces of 
the ranked rather than the original data, the most non-Gaussian dimension is 
determined from amongst the variables that are important for prediction. 
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4- 1.3. Remarks on Step 3. 

Instead of ([T|) we consider the multivariate regression model 

Y = X„,,hB, + E, (23) 

where B^ is the H x q matrix of coefficients for subsets of the ranked data, and 
E denotes the errors. The estimate of B^ corresponding to ([2]) is 



Br — yX^jjX.mM j ^m,ffY. 



Hence for a datum z, we extract the subvector z^ whose components correspond 
to the indices {ji, ..., jm} obtained in Step 1. The prediction for z via the model 
([23|) is obtained by y = B^r^z„i, which is nothing other than pT|) . Note that 
we can also express y in terms of the eigenvalues and eigenvectors of N^^X.'^'K^ 
as follows: 



1 " 1 
y = ]^ E i^Y^X„a,-7/z,„. (24) 

j=l 3 



4-2. Properties of h 

For univariate responses, so g = 1. the regression model ([1]) reduces to 

y = X/3 + e, (25) 

where y is the A^ x 1 response vector, (3 is the p x 1 coefficient vector, and e is 
the A^ X 1 error vector. In this case, from ([15]), 



a=(X^X)-^/^X^y,(y^y,)-V2^ 

and hence we have 

hi = ^ (X^X)-^/'x^y„ 

V'y^X(X^X)-^X^ye 

where y^ is the centred response. Therefore it follows that 

b = (X'^X) "^^^ hi ex (X'^X) "^ X'^yc = 3- (26) 

It is well known that (3 is characterised as the vector (3 which maximises 
the correlation between y and X/3 (see Section 10.2.1 of Mardia et al, 1979). 
Moreover any scalar multiple of (3 also maximises the correlation. Therefore 
using b is essentially equivalent to using (3 in our variable ranking since a scalar 
multiple does not affect the ranking. For q = 1 we actually use 

bi = 3 and £2 = (X^X)^^^3. 
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Wc compare ranking with b to that used in Bair et al (2006). Our method for 
choosing relevant predictors is based on the full canonical correlation matrix C 
between the predictors and the responses, or the full LSE. In contrast, Bair et al 
(2006) use correlation coefficients or separate LSEs for the univariate response 
and each univariate feature with a model similar to (|10p . This corresponds 
to considering only the marginals, while ignoring interactions, and admits the 
interpretation of a single LSE under the simplified linear model 

y = /3jXj +e for j = l,...,p, 

where e is an error vector with mean zero and variance <j^In- In other words, 
they apply simple regression with Xj as a single predictor, although their un- 
derlying setting is that of multiple regression ([25|) . They use Sj ~ xj"y/||xj||, 
with j = 1, ...,p, for their ranking, which can be obtained as a standardised LSE 
of the /3jS. Bair et al implement their sorting in an analogous way to Step 1 of 
our algorithm, and their threshold, which corresponds to the choice of m in our 
setting, is chosen by cross-validation. 

Next we briefly examine properties of the SjS. Under the usual manipulations 
conditional on x, it follows that 



E[^,] = T^ 



i¥=j 



var\sA = a . 



In particular, this calculation shows that Sj is not unbiased for /3j. On the other 
hand, since bi = [bn ■ ■ ■ bipY = /3, 



E 



'ij 



13 j , and var 



bij 



a^djj, 



where dke is the (fc, £)-th component of (X^X) . Note that b2 is not unbiased, 
but the variance of each of its components is a^. For p fixed, standard asymp- 
totic theory suggests that (X^X) = Op{N'^^) as N grows, and thus bij can 
estimate Pj more accurately than the biased Sj in the selection of relevant pre- 
dictors. It is also intuitively obvious that variable selection should include the 
correlation structure from all predictors. 

A further advantage of our ranking method over the selection of variables 
with s is that the b-ranking is applicable to multivariate response variables, 
while there is no obvious extension of s to a truely multivariate setting. 

Since b takes into account the whole correlation structure, computations are 
more complex and more involved than those resulting in the calculations of the 
SjS. The latter can be carried out very efficiently for any number of features as 
the complexity only grows with the number of variables. Choosing between these 
two ranking schemes therefore represents a compromise between computational 
efRciency and exploitation of the correlation structure. 
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4.3. Details on the Dimension Selection 

For2<k<p and Z^''^ = Z'-'^^A'^.^^'^ as in dH]) of Section 3.3, ^^(2^=)) increases 
with the dimension k. This means that direct use of /?fe(Z('^^) is not useful for 
the selection of the best k. 

Mimicking the derivation of the AIC and using some insight about null struc- 
ture motivate us to consider the behaviour of 



%h{i->) 



under the /c-dimensional standard Gaussian structure (see Sections 3.1 and 3.2 
in Koch and Naito, 2007). Under Gaussian assumptions it is shown that 

-Pk (Z^*^) j -* Tfe in distribution as iV" ^ 00, 
and hence also that 



E 




E[Tk] 



for large N, where Tk is the maximum of a zero mean Gaussian random field 
onW'^-i. 

The quantity E [Tk] cannot be calculated directly, but is estimated via the 
bounds below, here only given for kurtosis and adjusted to our scenario. 

Theorem [Koch and Naito (2007)] 

For each k < min{p, N} 



where 



LBk < E [Tfc] < UBk 



fc-i 

LBk = 2^ t^fc-eAfe_e,p-fc+e(tan 6 

e— 0.e;even 

UBk = LBk + Vo:6E[xp][l-^ie,k)]. 



Since the lower bound LB vanishes rapidly with dimension, Koch and Naito 
continue with upper bound UB. The interested reader is referred to their pa- 
per for details on the lower bound. In the notation of their theorem, 9 = 
cos~^(-\/0.6), ^ is a weighted sum of upper tail probabilities of the beta dis- 
tribution, and Xp denotes a x-distributed random variable with p degrees of 
freedom and p = ( 4 '^) . Good approximations for the values of UBfe are given 
in Koch and Naito (2007), and are denoted by UBfe. Koch and Naito propose to 
use these tabulated values as the constant Tk in p^ . Returning to the notation 
in Section 3.4.2, Step 2, we 
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^-/3fc(X™rfcA;'/')-[/Sfe 

• and take 

H{m) = argmax2<fe<„4, 

as tlie practical dimension selector of the ranked ?Ti-dinicnsional data Sm 
consisting of N samples. 

The theorem and the practical dimension selector show that the dimension 
H{m) is that for which the difference to the Gaussian is maximal. 

4.4- ^ General View of Supervised Principal Components 



In this section wc clarify why our algorithm can be regarded as a generalisation 
of the 'Prediction by supervised principal components' proposed by Bair et al 
(2006). Like our method, theirs starts with X^, which they obtain by their 
simpler variable ranking described in Subsection 4.2. We express the singular 
value decomposition (fT4)l of X^ in terms of the individual eigenvalues and 
eigenvectors and obtain 



X,„ = [mi •• ■UJn]diagl^/(h,...,^/d^^\ 



where, for j = 1, ...,m, the djS, UjS and VjS satisfy 



^m^mVj 



djVj, 



-^m-^iii'^j 



djUj, 



X-rnVj 



/djUj 



(27) 



In equations (6) and (7) of their Section 2.1, Bair et al (2006) describe their 
prediction y''^^ in a multiple regression model such as ([25|) . For a given datum 
Zm, and using our notation, this is essentially 



^spc _ 



(wf y) Zm 



1 



/rfT 



^'1 



/dl 



«fX^y 



/d. 



Vl 



N\i 



y^X„7i7fz™, 



(28) 



where the last equality can be seen using (22) and fTT]) . Comparing ([M|) with 
(PSj) , we see that y'^^'^ corresponds to the case H = I in ([M|) , and the prediction in 
(|24p is g-dimensional, while y'^^'^ is one-dimensional. In this sense our prediction 
is a natural generalisation of the method discussed by Bair et al (2006) to 
multivariate responses, where q > 2. 
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5. Results 

This section reports on the numerical performance of our algorithm and com- 
parisons of our method with that of Bair et al (2006). For the simulated data 
and the real data sets we predict y. To assess and compare the performances of 
the different methods we use the least squares error criterion 



1 A 

-*- Y^ 11^ ,, l|2 



1/2 



LSE(m)=<^-}^||y.-y.|n (29) 

where m is number of variables of the ranked submatrix X^ . 

In Figures 1-5 we will adhere to the following notation on methods: 

1. knbl-pcH: ranking with bi followed by PCR with H = H{m) selected 
components; 

2. knb2-pcH: ranking with b2 followed by PCR with H = H{m) selected 
components; 

3. bhpt-pcH: ranking as in Bair et al (2006) followed by PCR with H = 
H{m) selected components; 

4. bhpt-pcl: ranking as in Bair et al (2006) followed by PCR with the first 
component only; 

5. nr-pcH: PCR with H = H{m) selected components based on the original 
data without ranking. 

Here 'pcH' refers to Steps 2 and 3 of our algorithm, and 'pel' refers to the 
prediction step of Bair et al (2006). We have added two other prediction meth- 
ods, bhpt-pcH and nr-pcH. The first, bhpt-pcH, is a mixture of our method 
and that of Bair et al; it uses ranking as described in Bair et al (2006), but 
then applies Steps 2 and 3 of our algorithm. The last, nr-pcH, is only used in 
Example 3.2.1, and will be described further there. Like bhpt-pcH, it is used to 
assess the advantages of our variable ranking. 

We calculate the predicted values y and the error (P5)) for to > 2. In each case 
we indicate which of the five methods have been used. We show performance 
plots with m on the cc-axis, and LSE on the y-axis. 

5.1. Simulated Data 

Based on model (|10p we generate two sets of data. The first refers to a classical 
multiple regression framework, and the second models a high dimension low 
sample size (HDLSS) general multivariate regression setting. 

For both models we use the 3-dimensional source vector s = [si S2 ss]^ whose 
components are independently distributed as follows: si ~ uniform on[0, 1], S2 ^ 
Exponential with mean 1, and S3 ~ N{0, 1), the standard Gaussian. So if = 3 in 
the model. For the terms 6 and t] we use vectors with independent components 
distributed as 0.5 x iV(0, 1), in q dimensions for S and in m dimensions for 
T]. The values for q and to will be different in the two settings as will be the 
transformations P and W. 
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Example 5.1.1: 

the matrices 



Classical Data with g = l,p = 13 and N — 172. We use 



and 



W^ 



h 

2/3 

3 3 3 

' = [4 - 3 - 



(30) 



2], 



where Id denotes the d y. d identity matrix. The predictor is constructed by 
stacking x-^ = [xj' x|^] where x^ is a 6-dimensional vector, and each of the 
6 components is generated independently as 0.5 x A^(1.5, 1). Since the matrix 
P gives rise to to = 7, the predictor vector has 13 dimensions. We generate 
172 predictors and responses, and combine them to form one data set D = 
{(yi,xi),..., (2/172, X172)}. 

In the simulations we use 100 such data sets Dj. For each data set Dj we 



calculate the predicted values y and the error ((29)) for m = 2, . . . , 13 and each 
of the four prediction methods knbl-pcH, knb2-pcH, bhpt-pcH and bhpt-pcl. 
Figure [1] shows the performance of the four methods for a typical simulation. 




Fig 1. LSE versus dimension for simulated data; knbl-pcH (red), knb2-pcH (blue), bhpt-pcH 
(black dashed) and bhpt-pcl (black). 



The x-axis displays the dimension m of X„, and the y-axis shows the LSE 
as in (P5|) . All 100 simulations show that prediction with H components out- 
performs prediction with the first component only, as seen in this figure. Using 
H components leads to a strong decrease as dimenion increases and then the 
curves flatten out. In some cases, as for knbl-pcH in Figured! the curve has a 
minimum (here at 5) and then increases again. This behaviour is not uncommon 
and shows clearly that performance is not improved with more variables. 

The next figure (Figure [2) examines in more detail the best final dimension 
H for prediction. For each data set Dj and for each method, say Mi, with i = 
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1, . . . , 4, we find the dimension m — m(j, i) and the associated final dimension 
H = H{j, i) which minimises LSE. For bhpt-pcl we equate m and H , since only 
one dimension is selected in the process. Figure [2] shows the number of times 
out of 100 that a particular value H = H{j, i) was determined as the dimension 
which resulted in the minimum LSE. The values for H are shown on the x-axis 
and the y-axis shows the counts. So, for example, the final dimension H = 2 
was found to produce the smallest error in more than 60 of the simulations with 
method knbl-pcH , while bhpt-pcl resulted in smallest error less than 50 times 
for m = H = 2. The first three methods predominantly use 2 or 3 dimensions 
for best prediction. This agrees with our model in which H ~ 3 is used. Indeed, 
the simulations show that 2 out of these three variables are often sufficient for 
best prediction. 




Fig 2. Counts of best prediction versus final dimension H; knbl-pcH (red), knb2-pcH (blue) 
bhpt-pcH (white) and bhpt-pcl (black) . 



We complement these two figures with Table [T] which shows, for each final 
dimension H, how many times out of 100 simulations a particular method re- 
sulted in the smallest LSE. We see that for iJ = 2 knbl-pcH had the smallest 
LSE in 37 simulations, while knb2-pcH only scored 18 times for H = 2. When 
there were ties for the smallest error, each method scored. For this reason the 
totals add up to more than 100. The last column of the table gives totals for 
each method. We see that knbl-pcH performed considerably better than either 
of the two other pcH methods. In none of the simulations did bhpt-pcl perform 
best. For this reason bhpt-pcl is not included in the Table [T] 

Our results show that pcH methods, which are based on more than the first 
component; outperform bhpt-pcl. For the three pcH methods, ranking with bi 
appears to produce much better results and at lower dimensions than the other 
two ranking methods which perform similarly in this classical setting. 

Example 5.1.2: HDLSS Data with g = 7,p = 172 and N = 52. We use the 



imsart-ejs ver. 2008/01/24 file: ejs_2008_278.tex date: July 28, 2008 



Koch & Naito/ Prediction with a select number of components 



17 





2 


3 


4 


5 


6-8 


total 


knbl-pcH 


37 


13 


6 


1 





57 


knb2-pcH 


18 


8 


1 





1 


28 


bhpt-pcH 


20 


9 


3 


2 


3 


37 



Table 1 

Counts of best performances over all methods by dimension H. Ties account for the total of 
more than 100. PCR never performed best. 



model described in Example 5.1.1 to generate the data, but this time N « p. 
Instead of using the 6-dimcnsional vector x^ of the first example, Xi now has 
165 dimensions. The matrix P remains the same, but W is replaced by 



w 



4 -3 -2 

1 -2 

1 -2 

1 -2 



to allow for multivariate responses. 

We carried out 25 simulations and calculated the multivariate predictions 
and the error based on the rankings with bi and b2. Since the ranking of Bair 
et al (2006) applies to univariate responses only, we can not use bhpt-pcl or 
bhpt-pcH. 

The sample size A^ = 52 limits the rank of X^, so we can only consider 
dimensions m < 52. The best dimension selector of Koch and Naito (2007) 
contains values up to 50 dimensions, and for this reason we restrict X^, to 
maximally 50 variables. Figure [3] shows the performance of 2 typical simulations. 
The X-axis shows the dimension m against the LSE for knbl-pcH and knb2-pcH 
on the y-axis. 

Unlike the previous example, for the HDLSS data the performance with b2 is 
either en par or better than that of bi . For this reason we have shown the results 
of two simulations in Figure [31 Most performance curves had a sharp drop (big 
improvement in performance) at a specific dimension m rather than a gradual 
decrease in error. The final dimension H corresponding to the smallest error was 
at most 8 in both methods and all simulations. Table [2] shows results analogous 
to those of Figure [2] for the HDLSS example but this time as percentages. 



6 7 



knbl-pcH 32 
knb2-pcH 36 



24 
36 



12 

4 



12 
12 



12 
12 



Table 2 
Percentage of best prediction by method for each dimension H. 



Table [5] shows that two or three dimensions arc mostly enough for good 
prediction. This outcome is similar to that of the previous example. Closer 
inspection of the table and Figure [3] reveals that ranking with b2 results more 
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Fig 3. LSE versus dimension; knbl-peH (red) and knb2-pcH (blue). 



often in smaller error and at smaller dinienensions than ranking with bi . 
is the opposite of what we observed in Example 5.1.1. 



This 



5.2. Applications to Real Data 

We consider two real data sets, one with multivariate responses and N > p, 
and our final example is a gene expression data set where the response variable 
consists of survival times. These data complement our simulations in that the 
first multivariate data set is classical while the microarray data has the large 
dimension of 24481 and only 78 samples. 

Example 5.2.1: Illicit drug market data with q = 7,p ^ 10 and N = 66. 

The data contain monthly counts of events recorded by units of key health, law 
enforcement and drug treatment agencies in New South Wales, Australia. The 
data were collected over 66 months from January 1997 to June 2002 and consist 
of 17 different features which group into direct and indirect measures of the 
drug market. The data are described in Gilmour and Koch (2004). 

The goal of the present analysis is to predict the 7 indirect measures of the 
drug market from the 10 direct measures. 

The multivariate responses exclude use of bhpt-based ranking methods, in- 
stead we examine the effect of ranking, by comparing the two ranking schemes 
to non-ranked data. The crucial difference between our methods and nr-pcH is 
that Xm of Step 1 of our algorithm which contains the 'best' m variables is 
replaced by the first m variables of X without any sorting being applied, and so 
consist of the variables in the natural order in which they have been collected. 
Steps 2 and 3 of our algorithm remain the same in nr-pcH. 

Figure |4] shows the performance of our two methods and nr-pcH with the 
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dimension on the x-axis, and the LSE for the 7-dimcnsional responses on the 
y-axis. The methods based on ranking have a similar performance, and have 
generally smaller errors than nr-pcH, especially for smaller values of m. When 
all variables are used there is of course no difference in the performance, and 
the same value H, here iJ = 8 is selected in all three methods. 




Fig 4. LSE versus dimension for the illicit drug data with knbl-pcH (red),knb2-pcH (blue) 
and nr-pcH (black). 



Since the number of variables is small we show, in Tablc[3l the final dimension 
H that has been chosen for each of the three methods. We note in Table [3] that 
the final dimension does not always increase monotonically, as for to = 6 and 
knbl-pcH. Higher values of H for fixed to tend to result in lower errors, but due 
to the iterative nature of FastICA which is used in Koch and Naito (2007) to 
select if, it is not always possible to find the mixing matrix A. In such cases we 
first increase the number of iterations, and if this does not produce the desired 
mixing matrix, we decrease H by one. 



m 


2 


3 


4 


5 


6 


7 


8 


9 


10 


knbl-pcH 


2 


2 


4 


4 


3 


5 


6 


5 


8 


knb2-pcH 


2 


3 


4 


5 


5 


5 


5 


8 


8 


nr-pcH 


2 


2 


4 


5 


4 


4 


5 


5 


8 



Table 3 
Dimensions H for the illicit drug data with knbl-pcH, knb2-pcH and nr-pcH. 



Wc propose to use both ranking schemes and to let the data choose which 
method is preferable. In summary the analysis of the illicit drug data demon- 
strates the effect of ranking in decreasing the error and improving performance. 

Example 5.2.2: Breast cancer survival data with q = l,p = 24481 and 
N = 78. Our final example uses microarray data of expression levels of ap- 
proximately 25,000 genes of breast cancer patients. The first column of the data 
represents the time until metastasis (or the time until the patient left the study) ; 
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the second column is 1 if the tumor metastasiscd, and if it did not mctastasise 
within 5 years. There are 44 patients who survived beyond 5 years and 34 who 
did not. 

The data are described in van 't Veer et al (2002) and reference to further 
analyses of these data are given there. Mostly these data have been used for 
classification. As our interest is prediction of a continuous variable, we work 
with the actual survival times - given in months - as the response variables. 
For prediction of the time variable, the distinction between the two classes of 
patients is not relevant in itself. 

Although the time of 5 years to metastasis is of medical interest, the actual 
survival times and their accurate prediction will complement analysis based on 
classification and may lead to a better understanding of the genes that are 
associated with metastasis. 

The response variable is univariate, and we will therefore analyse the data 
using the four methods knbl-pcH, knb2-pcH, bhpt-pcH and bhpt-pcl as in 
Example 5.1.1. Our analysis focusses on two issues: 

• the selection of genes from each method; and 

• the LSE of each method. 

Selection of genes. We work with a total of 24,481 gene expresseion levels 
and 78 patients. All our analyses involve PGA which automatically restricts 
the number of genes to at most 78. Subsequent analysis with ICA and the 
dimension selector of Koch and Naito (2007) yields dimension selection for up 
to 50-dimensional data. For this reason we focus on a final selection of the 50 
most important genes in the analysis. Thus we want to select about 0.2% of the 
variables. 

The sorting of variables used by Bair et al (2006) is independent of the total 
number of variables since it compares one variable at a time with the response 
and then orders all variables. The ranking with bi and b2 is conceptually and 
computationally much more involved. Since we desire a relatively small pro- 
portion of genes, about 0.2% of a very large number, we include a preliminary 
ranking prior to Step 1 of our algorithm. The preliminary step is achieved by 
the following computations. 

1. Divide the data into L disjoint submatriccs X^^l of size N x s. 

2. For each £ = 1,...,L determine the r most relevant variables of X^^l 
obtained from ranking with b. This results in submatriccs XV of size 
N X T. 

3. Combine the pre-ranked submatriccs and obtain 

^[T-rank] ^ T-j^ll] ^[2] ^ ^ ^ ^[L] 

a matrix of size N x Lt, and note that Lt < p. 

4. Use x['^~''°"'^l instead of the original data matrix X as the input matrix 
to Step 1. 
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The advantages of preliminary or r-ranking are two-fold. Genes which are 
not important for prediction can be eliminated early; and computations espe- 
cially of the matrix C in p^ are much more efficient without loss of relevant 
intercorrelation information. 

Computations with the breast cancer data have shown that the choices of 
L, s, r and the initial partitioning into submatrices of size N x s are not 
very crucial, but some preliminary calculations with a range of values for L, s 
and r should be carried out before one settles on specific values for the actual 
analysis. For fixed L and s we partitioned the data in different ways into the 
initial submatrices X^^l and found that the overlap of genes in the resulting 
T-rankcd sets was well over 70% and that of the finally selected 50 genes well 
above 60%. We found that L = 5, s = 5000 and r ~ 200 work well for the breast 
cancer data. The preliminary ranking and ranking of Step 1 of the algorithm 
always employ the same ranking, so both times either bi or b2. 

A comparison of the degree of communality between the different ranking 
methods is given in Table ID This table shows the percentage of common genes 
between the variables selected for the three selection methods with bi , b2 and 
the ranking of Bair et al (2006). We have presented percentage overlaps for the 
1000 variables we obtain in the preliminary ranking, followed by the 200 best 
in Step 1, and the subset of the final 50 genes which are used for prediction in 
Figure El In Bair et al (2006) the best 1000, 200 and 50 are obtained in their 
ranking as a one-step process. In the table, we compare a 1000 genes of one 
method with 1000 genes of the other methods and 50 with 50. 



knbl 


knb2 




1000 


200 


50 


1000 


200 


50 


knb2 
bhpt 


58.6 
17.7 


55 
9.5 


46 

8 


33.8 


15 


14 



Table 4 
Percentage overlap of selected genes for best 1000, 200 and 50 variables for each ranking. 



It is interesting to observe that bi and b2 have 23 out of 50 genes in common 
for the final best genes. Bair et al (2006) 's ranking, on the other hand, appears 
to choose very different genes especially from those selected with bi. In the 
performance plots we will see how this choice of final genes affects the LSE. 

Prediction of survival times. From the best 50 genes for each ranking 
methods we predict the survival time, and we calculate LSE for each of the four 
methods over the range of dimensions m = 2, . . . , 50. The minimmir LSE and 
corresponding H for each method are given in Table [5] 



knbl-pcH knb2-pcH bhpt-pcH bhpt-pcl 
m 50 40 35 W 

H 11 12 13 

LSE(m) 13.84 12.23 17.80 20.56 

Table 5 
Smallest prediction error and best dimension for the breast cancer survival data. 
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The three pcH methods use very similar values for the best _ff, but their 
m- values differ. Here knb2-pcH has the smallest error, this is closely followed 
by knbl-pcH, while the best predictions of the other two methods are clearly 



worse. 




15 20 25 30 35 40 45 



50 



Fig 5. LSE versus dimension; knbl-peH (red), knb2-pcH (blue), bhpt-peH (blaek dashed) and 
bhpt-pcl (black). 



Our final performance results over the best 50 dimensions is shown in Figure 
[5] with the dimension m on the i-axis and the LSE on the y-axis. We note that 
bhpt-pcl is initially comparable to knbl-pcH and knb2-pcH until it reaches its 
best prediction at m = 20 from whence it increases again, while the two new 
competitors continue to decrease. The mixed method bhpt-pcH performs better 
initially than the other methods, but flattens out round m = 20. It does however 
have a clear minimum at ?7i = 35 after which its LSE increases again. Overall 
knbl-pcH and knb2-pcH have lower LSEs. They show fairly similar behaviour; 
knb2-pcH has the smallest LSE and reaches its minimum earlier than knbl-pcH, 
and then increases slightly. 

The results indicate that the 50 best genes were sufficient for prediction of 
the survival times, since all four methods had a minimum in this range of m- 
values and all apart from knbl-pcH showed a clear increase after having passed 
the minimum. The results also show that with the preliminary ranking prior to 
Step 1 of our algorithm our method can successfully be applied to HDLSS data 
with very large dimensions. 

6. Discussion and Conclusion 

This paper proposes a new method of supervised prediction in a regression 
setting which applies to multivariate responses, and to HDLSS problems as 
posed by gene expression data. The method intregrates variable ranking with a 
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novel use of choosing the best number of predictors in PC regression, and extends 
current work by Bair et al (2006). The advantages of our comprehensive ranking 
are seen especially for large numbers of variables. 

We demonstrate the performance of our method for multivariate responses 
in an HDLSS simulation and on real data. We test our approach against that of 
Bair et al (2006) in simulations and on microarray data of breast cancer survival 
times. The results convincingly show that our ranking combined with a careful 
selection of the number of components in PCR outperforms Bair et al (2006)'s 
method. The improved prediction comes however at a cost: The determination 
of the best dimensions, and the use of H PCR components is computationally 
more expensive than PCR with the first component only. 

Our results show the improved accuracy over existing prediction methods and 
open the way for further research into other variable ranking methods which 
could take into account data dependent information. 
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